NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Adaptively profiling models with task elicitation

https://doi.org/10.18653/v1/2025.emnlp-main.1270

Brown, Davis; Balehannina, Prithvi; Jin, Helen; Havaldar, Shreya; Hassani, Hamed; Wong, Eric (November 2025, Association for Computational Linguistics)

Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks—an order of magnitude more than prior work—where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.
more » « less
Full Text Available
Probabilistic Soundness Guarantees in LLM Reasoning Chains

https://doi.org/10.18653/v1/2025.emnlp-main.382

You, Weiqiu; Xue, Anton; Havaldar, Shreya; Rao, Delip; Jin, Helen; Callison-Burch, Chris; Wong, Eric (November 2025, Association for Computational Linguistics)

In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because earlier errors can corrupt judgments of downstream reasoning. To better detect such errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a probabilistic framework that evaluates each reasoning step based solely on previously-verified premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).
more » « less
Full Text Available
Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups

You, Weiqiu; Xue, Anton; Havaldar, Shreya; Rao, Delip; Jin, Helen; Callison-Burch, Chris; Wong, Eric (July 2025, PMLR)

Self-attributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a framework that transforms any differentiable model into a group-based SANN, where feature groups are learned end-to-end without group supervision. SOP achieves state-of-the-art performance for SANNs on vision and language tasks, and we validate that the groups are interpretable on a range of quantitative and semantic metrics. We further validate the utility of SOP explanations in model debugging and cosmological scientific discovery.
more » « less
Full Text Available
Dolphin: A Programmable Framework for Scalable Neurosymbolic Learning

Naik, Aaditya; Liu, Jason; Wang, Claire; Sethi, Amish; Dutta, Saikat; Naik, Mayur; Wong, Eric (July 2025, ICML 2025)

Full Text Available
Jailbreaking Black Box Large Language Models in Twenty Queries

https://doi.org/10.1109/SaTML64287.2025.00010

Chao, Patrick; Robey, Alexander; Dobriban, Edgar; Hassani, Hamed; Pappas, George J; Wong, Eric (April 2025, IEEE)

Full Text Available
The FIX Benchmark: Extracting Features Interpretable to eXperts

Jin, Helen; Havaldar, Shreya; Kim, Chaehyeon; Xue, Anton; You, Weiqiu; Qu, Helen; Gatti, Marco; Hashimoto, Daniel A; Jain, Bhuvnesh; Madani, Amin; et al (June 2025, Journal of Data-centric Machine Learning Research)

Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that are aligned with expert knowledge? To address this gap, we present FIX (Features Interpretable to eXperts), a benchmark for measuring how well a collection of features aligns with expert knowledge. In collaboration with domain experts, we propose FIXScore, a unified expert alignment measure applicable to diverse real-world settings across cosmology, psychology, and medicine domains in vision, language, and time series data modalities. With FIXScore, we find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts.
more » « less
Full Text Available
Observation of odd-parity superconductivity in UTe2

https://doi.org/10.1073/pnas.2419734122

Li, Zixuan; Moir, Camilla M; McKee, Nathan J; Lee-Wong, Eric; Baumbach, Ryan E; Maple, M Brian; Liu, Ying (February 2025, Proceedings of the National Academy of Sciences)

Symmetry properties of the order parameter are among the most fundamental characteristics of a superconductor. UTe₂, which was found to feature an exceedingly large upper critical field and striking reentrant behavior at low temperatures, is widely believed to possess a spin-triplet pairing symmetry. However, unambiguous evidence for such a pairing symmetry is still lacking, especially at zero and low magnetic fields. The presence of an inversion crystalline symmetry in UTe₂requires that, if it is indeed a spin-triplet superconductor, the order parameter must be of odd parity. We report here phase-sensitive measurements of the symmetry of the orbital part of the order parameter using the Josephson effect. The selection rule in the orientation dependence of the Josephson coupling between In, ans-wave superconductor, and UTe₂suggests strongly that UTe₂possesses the odd-parity pairing state of B₁usymmetry near zero magnetic field, making it a spin-triplet superconductor. We also report the apparent formation of Andreev surface bound states on the (1−10) surface of UTe₂.
more » « less
Full Text Available
Observation of odd-parity superconductivity in UTe2

Li, Zixuan; Moir, Camilla M; Mckee, Nathan J; Lee-Wong, Eric; Baumbachd, Ryan E; Maple, M Brian; Liu, Ying (February 2025, Proceedings of National Academy of Sciences)

Full Text Available
Data-Efficient Learning with Neural Programs

Solko-Breslin, Alaia; Choi, Seewon; Li, Ziyang; Velingker, Neelay; Alur, Rajeev; Naik, Mayur; Wong, Eric (December 2024, Neural Information Processing Systems Foundation (NIPS Foundation))

Many computational tasks can be naturally expressed as a composition of a DNN followed by a program written in a traditional programming language or an API call to an LLM. We call such composites "neural programs" and focus on the problem of learning the DNN parameters when the training data consist of end-to-end input-output labels for the composite. When the program is written in a differentiable logic programming language, techniques from neurosymbolic learning are applicable, but in general, the learning for neural programs requires estimating the gradients of black-box components. We present an algorithm for learning neural programs, called ISED, that only relies on input-output samples of black-box components. For evaluation, we introduce new benchmarks that involve calls to modern LLMs such as GPT-4 and also consider benchmarks from the neurosymbolic learning literature. Our evaluation shows that for the latter benchmarks, ISED has comparable performance to state-of-the-art neurosymbolic frameworks. For the former, we use adaptations of prior work on gradient approximations of black-box components as a baseline, and show that ISED achieves comparable accuracy but in a more data- and sample-efficient manner.
more » « less
Full Text Available
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models.

Chao, Patrick; Debenedetti, Edoardo; Robey, Alexander; Andriushchenko, Maksym; Croce, Francesco; Sehwag, Vikash; Dobriban, Edgar; Flammarion, Nicolas; Pappas, George J; Tramer, Florian; et al (December 2024, NeurIPS 2024 Datasets and Benchmarks Track)

Full Text Available

« Prev Next »

Search for: All records